In this section, we will implement the evaluation of all possible bivariate models, where each model includes a single predictor to predict the target variable GPP. The goal is to identify the best-fitting model based on the R² and AIC values.
Main Implementation Function
process_models <- function(df, variables, variable_metadata, target = "GPP_NT_VUT_REF") {
# Initialize storage objects
models <- list() # Stores fitted linear models
var_names <- list() # Stores variable names for display
formulas <- list() # Stores model formulas
AICs <- numeric() # Stores AIC values
R2s <- numeric() # Stores R-squared values
plots <- list() # Stores diagnostic plots
best_model <- NULL # Will track the best performing model
# Process each predictor variable
for (var in variables) {
# Prepare data: select target and predictor, remove missing values
df_model <- df |>
dplyr::select(all_of(c(target, var))) |>
na.omit()
# Create and fit linear model
formula <- as.formula(paste(target, "~", var))
model <- lm(formula, data = df_model)
# Store model components
formulas[[var]] <- formula
models[[var]] <- model
AICs[[var]] <- AIC(model)
R2s[[var]] <- summary(model)$r.squared
# Extract metadata for labeling
var_index <- which(variable_metadata$variable == var)
var_name <- sub(",.*", "", variable_metadata$description[var_index])
var_expr <- paste0(var_name, " (", variable_metadata$units[var_index], ")")
# Generate diagnostic plot
plots[[var]] <- plot_bi(df_model, var, var_expr)
var_names[[var]] <- var_name
# Print progress information
message("Processed: ", var,
" | R² = ", round(R2s[[var]], 3),
" | AIC = ", round(AICs[[var]], 1))
# Update best model if current model performs better
if (is.null(best_model) || R2s[[var]] > R2s[[best_model]]) {
best_model <- var
}
}
# Compile best model information
best_model_info <- list(
model = best_model,
R2 = R2s[[best_model]],
AIC = AICs[[best_model]]
)
# Return all results
return(list(
formulas = formulas,
models = models,
AICs = AICs,
R2s = R2s,
plots = plots,
var_names = var_names,
best_model_info = best_model_info
))
}
df_clean: This is the dataset that contains the target variable (GPP_NT_VUT_REF) and the predictors.
predictors: The list of predictor variable names to evaluate in the bivariate models.
lm(formula, data = df_model): This fits a linear regression model to predict the target variable based on the selected predictor.
R² and AIC: These metrics are calculated for each model. R² measures the proportion of variance in the target variable explained by the model, and AIC helps identify the model that balances fit and complexity.
- Best Model Selection: The function identifies the model with the highest R² value and prints out the corresponding AIC for the best-fitting model.